Speech extraction from vibration signals based on deep learning

Extracting speech information from vibration response signals is a typical system identification problem, and the traditional method is too sensitive to deviations such as model parameters, noise, boundary conditions, and position. A method was proposed to obtain speech signals by collecting vibration signals of vibroacoustic systems for deep learning training in the work. The vibroacoustic coupling finite element model was first established with the voice signal as the excitation source. The vibration acceleration signals of the vibration response point were used as the training set to extract its spectral characteristics. Training was performed by two types of networks: fully connected, and convolutional. And it is found that the Fully Connected network prediction model has faster Rate of convergence and better quality of extracted speech. The amplitude spectra of the output speech signals (network output) and the phase of the vibration signals were used to convert extracted speech signals back to the time domain during the test set. The simulation results showed that the positions of the vibration response points had little effect on the quality of speech recognition, and good speech extraction quality can be obtained. The noises of the speech signals posed a greater influence on the speech extraction quality than the noises of the vibration signals. Extracted speech quality was poor when both had large noises. This method was robust to the position deviation of vibration responses during training and testing. The smaller the structural flexibility, the better the speech extraction quality. The quality of speech extraction was reduced in a trained system as the mass of node increased in the test set, but with negligible differences. Changes in boundary conditions did not significantly affect extracted speech quality. The speech extraction model proposed in the work has good robustness to position deviations, quality deviations, and boundary conditions.


Introduction
The extraction of speech signals is the first step in speech processing.Sound data collected by the acoustic signal hardware acquisition equipment are generally waveform data.They are dependence of speech (as timing-sequence signals) in the time dimension is also vital for speech denoising.A deep learning based integrated architecture called FuzyGCP has been proposed for recognizing spoken language from speech signals by Garain et al. [15].The architecture combines the classification principles of deep dumb Multilayer perceptron (DDMLP), deep Convolutional neural network (DCNN) and Semi-supervised Generative Adversarial Network (SSGAN) to maximize the accuracy, and finally uses Choquet integral to apply Ensemble learning to predict the final output.However, CNN is not good at extracting timingsequence information.Therefore, researchers applied recurrent neural networks (RNNs) that are better at processing timing-sequence information to speech denoising.Tan et al. [16] discovered a speech denoise model based on convolutional recursive networks (CRNs).It denoises better by combining the advantages of CNNs in extracting local information with that of LSTM in time modeling.Defossez et al. [17] combined the gating mechanism with CRN.They upgraded the model by adding a gating mechanism into each layer of CNN to filter noises.Since its inception, RNN has undergone continuous evolution and iteration, applying LSTM to the previously difficult problem of implementing back propagation.The simplified GRU developed by LSTM can also maintain a certain level of accuracy.However, as mentioned in a recent public paper, as a member of the CNN family, Temporal convolutional networks (TCN) [18], successfully defeated RNN in various datasets and became a promising member in the analysis of new sequence data.Tandale et al. [19] evaluated the gated RNN model including LSTM and GRU units, followed by the TCN architecture to develop an effective alternative model to learn the midpoint deformation behavior of complex path related shock wave loading plates.
Transformer [20] has also been applied to natural language processing and image processing, and speech-denoising models based on Transformer appear recently.Kim et al. [21] proposed a frequency-domain speech-denoising model based on Transformer.It promotes the Transformer in speech denoising by adding Gaussian weighting to all self-attention layers so that nearby frames have greater attention weight.Wang et al. [22] proposed a time-domain speech-denoising model based on a two-stage Transformer.The model extracts the local and global timing information of long-term speech sequences, which can achieve a good denoising effect.An improved Swin Transformer has been proposed for segmenting dense urban buildings from remote sensing images with complex backgrounds [23].The original Swin Transformer was used as the backbone of the encoder, while the convolutional block attention module was used in the linear embedding and patch merging stages to focus on important features.Then, the hierarchical feature map is fused to enhance the feature extraction process, and it is input into UPerNet (as a decoder) to obtain the final segmentation map.The collapsed and non collapsed buildings were marked from remote sensing images of the Yushu and Beichuan earthquakes.Perform data augmentation for horizontal and vertical flipping, brightness adjustment, uniform and non-uniform atomization to simulate actual situations.The effectiveness and superiority of this method compared to the original Swin Transformer and several mature CNN based segmentation models were verified through ablation experiments and comparative studies.
In conclusion, deep learning has been successfully applied in signal processing, image recognition, Machine translation, speech recognition, emotion recognition, etc.There has been no research on extracting speech signals through vibration signals, when acoustic signals are inconvenient to be measured directly in some cases.The focus of this article is to extract speech signals from the acoustic vibration coupling model and verify the effectiveness of deep learning methods for such problems.Section 2 introduces the proposed model and method, two types of networks are applied to the task: fully connected, and convolutional.Section 3 analyzes how the response point positions, noises, position deviations, node quality, and boundary conditions affect extracted speech quality.Section 4 summarizes the work.

Model and method
The main problem to be studied in the work is extracting sound signals through vibration response.

Acoustic-structure response
A flat plate was selected to be placed between the infinitely large baffles to simplify the analysis.The vibration of the plate was caused by the sound waves acting on the plate.The vibroacoustic response was obtained by the finite element method.Plate elements were modeled by shell elements with three mobile and two rotational degrees of freedom per node [24].The displacement field in the shell element can be expressed as follows according to the classical plate-shell theory.uðx; y; z; tÞ ¼ u 0 ðx; y; tÞ À z @w 0 @x ð1Þ vðx; y; z; tÞ ¼ v 0 ðx; y; tÞ À z @w 0 @y ð2Þ wðx; y; z; tÞ ¼ w 0 ðx; y; tÞ ð3Þ where u, v, and w are the displacements in the x, y, and z directions, respectively, while u 0 , v 0 , and w 0 are the displacements on the neutral surface.Its matrix form is The displacement on the neutral surface can be expressed by the unit interpolation function as follows: i¼1 u 0i ðtÞc i ðx; yÞ ð6Þ v 0 ðx; yÞ ¼ S 4 i¼1 v 0i ðtÞc i ðx; yÞ ð7Þ w 0 ðx; y; tÞ ¼ S 4 i¼1 w 0i ðtÞg i1 ðx; yÞ þ @w 0i ðtÞ @x g i2 ðx; yÞ þ @w 0i ðtÞ @y g i3 ðx; yÞ where ψ i (i = 1, 2, 3, and 4) is the linear interpolation function, while g ij (j = 1, 2, and 3) is the non-conforming Hermite cubic interpolation function, that is Eq (4) can be written in the following form according to Eqs (6)(7)(8)(9)(10)(11)(12): where It can be obtained from the relation between strain and displacement that Eq ( 13) is substituted into the above equation to obtain where [B s ] can be obtained by differentiating [N s ].The mass matrix and stiffness matrix of the shell element can be written as follows from the virtual work principle.
where J is the element Jacobian matrix; [E s ] the shell element stiffness matrix; V e the unit volume.The unit matrix is assembled into an overall matrix (Eq (21)) where [M] is the overall mass matrix; [K]the overall stiffness matrix; f the force vector.

Verification
Part of the Mozilla Universal voice dataset [25] is used to train and test deep learning networks.
The dataset contains 48-kHz recordings of subjects dictating short sentences.The clean audio signals are first downsampled to 8 kHz to reduce computational load on the network because speech is usually lower than 4 kHz.Then, the Newmark method is used to solve the dynamic response, with a step length of 1/8000 s.
The predictor signals and the network target signals transform the vibration response or pure speech signals into the frequency domain using short-term Fourier transform (STFT), with a window length of 128 samples, an overlap rate of 75%, and a Hamming window.The size of the spectral vector can be reduced to 65 by discarding the frequency samples corresponding to the negative frequency.Since time-domain speech signals are real, they will not cause any information losses.The input of the predictor variable consists of 8 consecutive audio signal STFT vectors, so each STFT output estimate is calculated based on the current audio STFT and 7 previous audio STFT vectors.
Firstly, A network composed of fully connected layers is used to extract audio in the work, and the input size is specified as an image of 65×8.Two hidden fully connected layers are defined, each with 2,048 neurons.Since it is a purely linear system, there is a rectified linear unit (ReLU) layer behind each hidden fully connected layer.The batch normalization layer normalizes the mean and standard deviation of the output.A fully connected layer containing 65 neurons is added, followed by a regression layer.The inverse STFT transform is performed through the inverse short-term Fourier transform (ISTFT) and the phase of the STFT vector of the vibration signals is used to reconstruct the time domain speech signals in the test set.
Then, consider using a convolutional layer instead of a fully connected network [10].The 2D convolutional layer applies a sliding filter to the input.This layer calculates the weight and input dot product by moving the filter vertically and horizontally along the input, and then adds bias terms to convolution the input.Convolutional layers typically consist of fewer parameters than fully connected layers.Define the layers of a fully convolutional network described in [10], including 16 convolutional layers.The first 15 convolutional layers are a group of 3 layers, repeated 5 times, with filter widths of 9, 5, and 9, and filter numbers of 18, 30, and 8, respectively.The last convolutional layer has a filter width of 129 and one filter.In this network, convolution is performed only in one direction (along the frequency dimension), and for all layers except the first layer, the filter width along the time dimension is set to 1.Similar to the fully connected network, convolutional layers are followed by ReLu and batch normalization layers.
The physical parameters of the rectangle plate are list in Table 1.The boundary conditions are taken as fixed at four edges, and 10 × 5 elements are used in the FE model as shown in Fig 2 .Force is vertically incident according to the sound and acts evenly on the node, regardless of the influences of plate sound radiations and sound pressures.
When a node with a position of (30, 20 mm) is selected as the vibration response point as shown in Fig 2, the corresponding clean speech, vibration response, and extracted response are obtained by using fully connected layers and convolutional layers.The training process is shown in Figs 3 and 4. We can find that the rate of convergence is faster and the training time is shorter (3min17sec for fully connected layers and 19min34sec for convolutional layers) using fully connected layers than using convolutional layers.
The extracted speech in time domain and spectrogram is shown in Fig 5 .The method proposed in the work can extract and reconstruct the speech signals from the vibration response.The widely used objective evaluation indices of speech quality include perceptual evaluation of speech quality (PESO) [26], MOS predictor of speech distortion (CSIG), MOS predictor of intrusiveness of background noise (CBAK), and MOS predictor of overall processed speech quality (COVL) [27].These objective indices have a high correlation degree with people's subjective sense of hearing and can better measure speech quality.PESQ is an objective speech quality evaluation index launched by the International Telecommunication Union with scores ranging from -0.5 to 4.5.The higher the score, the higher the speech quality.CSIG, CBAK, and COVL are complex objective evaluation indices, which can obtain a better evaluation of speech quality through a linear combination of other objective evaluation indices.Their score distribution is between 1 and 5.The higher the score, the better the speech quality.
CSIG assesses speech quality from the perspective of signal distortion.CSIG is linearly weighted by PESQ, Log-LikelihoodRatio (LLR), and Weighted Spectral Slope (WSS) (Eq (22)).CBAK evaluates speech quality from the perspective of background noise interference.CBAK is a linear weighting of PESQ, WSS, and Segment Signal-Noise ratio (SegSNR) (Eq (23)).COVL reflects the overall quality of the signal, which is also obtained by the linear weighting of PESQ, LLR, and WSS.Eq (24) lists the calculation method.Detailed calculations of LLR, WSS, and SegSNR refer to the studies of Hu et al. [27].
The objective evaluation of extracted speech using different layers is shown in Table 2.We can find that the speech quality extracted using Fully Connected layers is better than that using Convolutional layers.Because the Fully Connected network prediction model has faster Rate of convergence and better quality of extracted speech, this method will be used to analyze the impact of other factors in the subsequent research of this paper.

Node position
This section examines the influence of the response node position on speech extraction because the vibration response is related to the structure position.The node at 1/4 of the structure is selected due to the symmetry of the structure.Each node is trained separately and its objective evaluation indices of speech are compared (Table 3).The speech extraction results of all nodes are quite good.
Due to the correlation between objective evaluation indicators of speech quality, we have chosen PESQ, CSIG, CBAK and COVL as evaluation indicators in observations of Analysis of Variance (ANOVA).the results of ANOVA indicate that there is no significant difference (P>0.05)in the speech extraction effect of different nodes by considering the sample selection error in the training process, which can be seen in Tables 4 and 5.

Noises
Speech signals and vibration response signals will inevitably be disturbed by noise in actual situations.The nodes at (30, 20 mm) positions are selected as the research objects without losing generality in this section.Table 6 shows the objective evaluation of extracted speech quality when white noises with the signal-to-noise ratios (SNR) of 5, 0, and -5dB are added to the pure speech signals.The increased speech noise signals will reduce the quality of extracted speech.The collected vibration response signal may have noise interference, affected by the external environment or sensors.Table 7 shows the objective evaluation of the extracted speech quality when 5, 0, and -5dB white noise are added to the vibration response signals separately under pure speech.Similar to noisy speech, the vibration response signal alone reduces the quality of speech extraction as the noise increases.Its impact is less than that of noise speech by comparing Table 6.
Fig 7 shows the time domain and time-frequency domain spectrum of speech extracted when the noise is added to vibration response signals with SNR = 0 dB.The impact of noise on the vibration signals alone is less than that of noise on the speech alone.The noise of the vibration response will be amplified by the speech signal through the acoustic vibration system, which will reduce the quality of the extracted noise by comparing Fig 6(B).
Table 8 shows the objective evaluation of the extracted speech quality when 5-dB white noises are added to the speech signals; 10-, 5-, and 0-dB white noises are added to the vibration    addition of composite noise has caused a sharp decrease in extracted quality.Its extraction quality needs to be further enhanced although this method has certain speech extraction capabilities.

Location deviation
The location of the vibration response sensor may change due to sensor sliding or laser vibration sensor positioning deviation in practical applications, which results in inconsistencies  between the sensor location during training and the prediction.This discusses the impact of location deviation on the prediction system.The nodes at the (30, 20 mm) are still selected as the vibration response nodes during training, and the locations of other nodes are selected for comparison during the test set.The quality of extracted speech is axially symmetric concerning the geometric center line of the plate structure (Table 9) because the structure, boundary conditions, and excitation are all axially symmetric.Its vibration mode is also symmetric with the same amplitude from the structural modal analysis.As a result, the speech quality of the symmetrical points in the test set has symmetry.The closer the test point is to the symmetry center, the better the extracted speech quality.It is even better than a situation where the training point and the prediction point are consistent.Extracted speech quality decreases to a certain extent at the test points close to the boundary of the structure.In conclusion, the location deviation does not have much impact on speech quality within a certain range of the training points, which provides conditions for actual engineering applications.

Added mass
There are still impurities or mass changes in the sensor, which changes the mass matrix of the structure in the actual application of the plate structure.This section examines the sensitivity of extracted speech quality when the test node mass changes without changing the prediction model.Similarly, nodes at (30, 20 mm) are selected as the test nodes.The mass of the node remains unchanged during the training, but the node mass is gradually increased during testing.The objective evaluations of extracted speech quality are shown when the mass of the test node are increased by 1, 5, and 10 times of element mass, respectively (Table 10).Extracted speech quality decreases as the node mass increases, but the impact is subtle.Besides, it has good robustness.

Boundary conditions
The boundary conditions of the plate may change in actual situations, such as boundary loosening.It is assumed in this section that the boundaries at (100, 10), (100, 20), (100, 30), and (100, 40) change from a clamped support to a free boundary.Table 11 shows that the vibration response of nodes at (30, 20 mm) position is taken as the training set under the condition of clamped support on four sides.The objective evaluation of speech extracted from 3 different nodes can get better results at different test points in the case of boundary changes in the test set.The speech quality is no longer symmetrical.The closer it is to the free boundary, the better the speech recognition quality.The lower the structural flexibility, the better the speech

Conclusion
The Increased node mass reduced extracted speech quality during the test, but it had little effect.Meanwhile, extracted speech quality did not change significantly if the boundary conditions changed during the test, and the location deviation did not have much impact.Besides, the lower the structural flexibility, the better the speech extraction quality.The speech extraction model proposed in the work had good robustness to location deviation, mass deviation, and boundary conditions from the above analysis.It had better advantages and engineering application prospects compared with traditional pattern-recognition methods.The effectiveness of the proposed method will be verified by experiments in the next step, and the training model will be further optimized to reduce the influence of speech and vibration signal noises on extracted speech quality.
Fig 1 lists the framework.It is assumed that sound waves are incident vertically on a flexible sheet, and the vibration response on the sheet is collected through a sensor.The predictor signals and the network target signals are the amplitude spectra of the vibration response signals and the clean audio signals, respectively.The output of the network is to extract the amplitude spectra of the speech signals.The regression network uses the input of the predictor variable to minimize the mean square error between its output and input targets.The amplitude spectra of the output and the phase of the vibration signals are used to convert the extracted audio back to the time domain.

Fig 3 .
Fig 3. Training process using fully connected layers.https://doi.org/10.1371/journal.pone.0288847.g003 Fig 4 shows the time domain and time-frequency domain spectrum of the extracted speech when the SNR equals 0 dB.Speech signals containing noise have a great impact on vibration response signals, which affects the extraction quality of pure speech by comparing Figs 5(B) and 6(B).However, it can still extract pure speech, which shows good noise robustness.

Fig 8 .
Fig 8. (a) Noisy speech (SNR = 5 dB); (b) Vibration response with noises (SNR = 5 dB); (c) Extracted speech.Note: Left: time domain; Right: spectrogram.https://doi.org/10.1371/journal.pone.0288847.g008 vibration signals of the acoustic system were collected in the work for deep learning training to obtain speech signals.It is found that the Fully Connected network prediction model has faster Rate of convergence and better quality of extracted speech than that using Convolutional layers.The simulation results showed that better speech quality was obtained in many cases.The location of the vibration response point had little effect on the speech recognition quality during training.The noises of speech signals had a greater influence on the quality of speech extraction than that of vibration signals.The effect of speech quality extraction was poor when both had high noise degrees, but it still had a certain extraction ability.The quality of speech extraction was less affected by the deviation of the vibration response location during training and testing, and it had good robustness within a certain range.The speech extraction quality was also symmetrical.The lower the structural flexibility, the better the speech extraction quality.

Table 10 . Changes in node mass-an objective evaluation of extracted speech.
://doi.org/10.1371/journal.pone.0288847.t010extraction quality combined with Section 3.3.The speech extraction model in the work is also insensitive to boundary changes. https